Exploratory analysis of digits dataset

1. Import packages



In [2]:

    
%matplotlib inline
# imports
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

2. load, inspect data



In [6]:

    
train = pd.read_csv('../data/train.csv')



In [7]:

    
train.head()









    Out[7]:






  
    
      
      label
      pixel0
      pixel1
      pixel2
      pixel3
      pixel4
      pixel5
      pixel6
      pixel7
      pixel8
      ...
      pixel774
      pixel775
      pixel776
      pixel777
      pixel778
      pixel779
      pixel780
      pixel781
      pixel782
      pixel783
    
  
  
    
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      3
      4
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      4
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
  

5 rows × 785 columns



In [63]:

    
x_4 = train.ix[3].values[1:]
x_1 = train.ix[2].values[1:]
four = np.vstack([x_4,x_4,x_4]).T.reshape(28,28,3)
one = np.vstack([x_1,x_1,x_1]).T.reshape(28,28,3)



In [64]:

    
show1 = plt.imshow(one)



In [65]:

    
show4 = plt.imshow(four)

This tells us:

the image data represent the edges of numbers
the edges are not always complete
the intensity values for a pixel range from 0 to 255
there is only one channel to deal with in the image arrays
the images are sized 28 by 28 pixels

And this raises some questions:

are the images generally positioned in the center of the image?
- if not, the classification might be improved by detecting the 'center of mass' for a number, starting from there could be useful
are the images generally positioned upright?
- row 0's 1 was at quite an angle. could the analysis benefit from some kind of regularization that sets straight lines to vertical?
are numbers in the images generally of the same size?
does the number of pixels that are not zero correllate well with the number / a type of number?
if I run a simple linear regression on the rows here, do any of these questions matter?

	label	...
0	1	...
1	0	...
2	1	...
3	4	...
4	0	...

	label	...
0	1	...
1	0	...
2	1	...
3	4	...
4	0	...

	label	...
0	1	...
1	0	...
2	1	...
3	4	...
4	0	...